The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course, both to learn the statistical concepts discussed in class and also to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface. This is going to feel very unfamiliar. The more questions you ask me, the quicker you will get up to speed on the software. These labs will build on each other. You will want to refer back to previous labs often to remind yourself how to perform certain tasks.
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with an introduction to some of the fundamental building blocks of R and RStudio: the interface, reading in data, basic commands, data types, and visualization.
Open RStudio and create a new Markdown document (File|New File|R
Markdown, or use the icon dropdown menu on the upper left). Save this to
wherever you are going to keep your labs. I suggest having a separate
folder for each lab, since multiple files are generated. When you Knit
you might have an error about packages, just install any that are needed
and ask for help if you get further errors. You can, and should, delete
all the pre-loaded text and code UNDER the first R chunk (the
setup
chunk should stay).
The lab contains “Exercises”, and you should make a separate header
for each exercise (e.g. type ## Exercise 1
). So your
Markdown document will typically have the following flow:
Not all exercises will require that you write text, and a few won’t require that you include R code. But if you need output from R to answer a question, that code must be included in your Markdown document.
The panel in the upper right contains your workspace (aka environment) as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.
The panel on the lower left is where the action happens. It’s called the console. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
You can use R as a calculator. To get you started, enter the
following command at the R prompt (i.e. right after >
on
the console). You can either type it in manually or copy and paste it
from this document.
2+2
And you can save this result to an object that you can access later
<- 2+2 x
The arrow <-
is called an ASSIGNMENT OPERATOR, and
tells R to save an object called x
that has the value of 4.
This is similar to saving a value in a graphing calculator. There is a
keyboard shortcut for this; it’s easy to use Google to find keyboard
shortcuts (they will be different for Macs vs. PCs). There is also a
keyboard shortcut menu in the Help menu in RStudio. I would recommend
learning the shortcut for <-
first; it’s annoying to
type and you’ll use constantly.
Note that whatever name you want to save your object as must always be to the left of the assignment operator. You can also see this new object in your environment on the upper right pane.
Try typing x
in the console to verify its value.
Throughout the semester you will learn about how to use R to do data analysis, and in the meantime you will be exposed to some programming. In addition, you will learn best practices for saving your code and making sure that your analysis is reproducible.
When you want to write a paper, you might open a Word document to type your ideas into, and save your work in. In RStudio we use a document type called an R Markdown document. R Markdown documents are useful for both running code and annotating the code with comments. The document can be saved, so you can refer back to your code later, and can be used to create other document types (html, word, pdf, or slides) for presenting the results of your analyses. R Markdown provides a way to generate clear and reproducible statistical analyses. In an R Markdown document, you can write text just like in Word, but you can also put ‘chunks’ of code in the document. Then you will ‘Knit’ the Markdown file to create a document with your text and it will also run your code and include the results in the document.
You’ll need to figure out whether code is needed to answer a particular question, and if so a new chunk of code can be inserted by clicking on the Insert button and choosing R from the dropdown menu. Again, there is a keyboard shortcut for this.
If you have pop-ups blocked on your laptop, you may see a box come up warning you when you Knit. Just click Try Again and you should see the results. You can also choose to see your output in the Viewer tab. You can change this setting by clicking the gear icon next to the Knit button. As you go along, you will discover that you have certain preferences (e.g. you might always want your knitted documents to display in the Viewer tab rather than a separate window). Tools|Global Options gives you a lot of ways to customize your RStudio experience that will hold for all R sessions (so you don’t always have to change options manually each time you make a new Markdown document). I would suggest spending a few minutes exploring this menu; ask me if you any preferences you’d like to change and don’t see how to do that.
R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:
tidyverse
: for data wrangling and visualizationopenintro
: for the datasets used in the R labs in this
courseEven if these packages have already been installed on your laptop, you will still need to load them in your working environment in order to use them. To do so, type the following in the console (since you have opened a Markdown document, you might need to click on the Console tab at the bottom left of your screen and resize the window to see the Console better - you should experiment with resizing the four panes to your liking):
library(tidyverse)
library(openintro)
Note that these lines of code need to also appear at the top of your R Markdown document (I often put them in the code chunk labeled ‘setup’ that is automatically inserted when you make a new Markdown document). We need to load the packages both in the console and in your R Markdown document since these two environments work independently of each other.
A note on the difference between the console and your Markdown document: Think of the console as your sandbox. You can try out code there and see what happens. But your final product is your Markdown document, and they are separate. When R Knits, it creates a new enviroment, starts at the top of your Markdown document, and runs all the code in order. So any code you need must be in your Markdown document. That’s also why, when you want to play around in the console (your sandbox), you also have to make sure you have everything loaded there, as well (like your packages). Fortunately, you can tell R to run code from your Markdown document INSIDE the console, so you don’t have to type everything twice. We’ll get to that next.
To get you started, make a new code chunk and type and run the following command from your Markdown file.
data("arbuthnot")
You can do this by
Think of “running code” in your console as telling R to “do this now”.
This command instructs R to load some data: the Arbuthnot baptism
counts for boys and girls. You should see that the workspace area in the
upper righthand corner of the RStudio window now lists a data set called
arbuthnot
that has 82 observations on 3 variables.
A note on Promise: When you load this dataset, you will see the word ‘Promise’ show up in the environment. This is because R doesn’t want to take up memory until you actually need to DO something with this dataset. If you double-click the name ‘arbutnot’ the data will open in a Viewer AND you’ll see the word ‘Promise’ change to the number of observations and variables.
This dataset is contained in the openintro
library. As
you interact with R, you will create a series of objects. Sometimes you
load them as we have done here, and sometimes you create them yourself
as the byproduct of a computation or some analysis you have
performed.
The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.
arbuthnot
An advantage of RStudio is that it comes with a built-in data viewer.
Click on the name arbuthnot
in the Environment
pane (upper right window) that lists the objects in your workspace. This
will bring up an alternative display of the data set in the Data
Viewer (upper left window). You can close the data viewer by
clicking on the x in the tab above the data.
What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.
Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame (or tibble).
You can see the dimensions of this data frame by typing (in the console):
dim(arbuthnot)
This command should output [1] 82 3
, indicating that
there are 82 rows and 3 columns (we’ll get to what the [1]
means in a bit), just as it says next to the object in your workspace.
You can see the names of these columns (or variables) by typing (in the
console):
names(arbuthnot)
You should see that the data frame contains the columns
year
, boys
, and girls
. At this
point, you might notice that many of the commands in R look a lot like
math functions; that is, invoking R commands means supplying a function
with some number of arguments. The dim
and
names
commands, for example, each took a single argument,
the name of a data frame.
Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like
$boys arbuthnot
This command will only show the number of boys baptized each year. The dollar sign basically says “go to the data frame that comes before me, and find the variable that comes after me”. It’s a way to reference a particular column.
Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector (for the boys). And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.
Technically, a vector is a data object that has multiple elements of
the same type. So far, you have created one vector called x
that has one element in it. That element’s type is
numeric.
Copy, paste and run the following command into the console.
<- c(2, 4, 6) v
This vector contains three numbers, 2, 4, and 6. The c()
function says to r to concatenate
the values 2, 4, 6, into
a single vector. Note in the Environment pane that your vector
v
contains numbers (listed as
num
). In other words, its type is numeric.
You can do math on a vector that contains numbers! For instance,
copy, paste and run the following command into the console. This tells R
to multiply each element of the vector v
by 3.
* 3 v
You can also make vectors of characters (words or strings).
Copy, paste and run the following command into a new code chunk. This
vector has the name char.vec
and contains 3 elements, all
of which are designated as characters (or chr
in the
Environment pane). It doesn’t matter that “2” is a number, putting the
elements in quotes tells R that they are all character data types, not
numeric.
<- c("2", "Wheaton", "red") char.vec
new.vec
which contains
your full name with each word as separate elements of the vector.R has some powerful functions for making graphics. We will use the
ggplot
function for data visualization. Its first argument
is the data you’re visualizing. Next we define the
aes
thetic mappings. In other words, the columns of the data
that get mapped to certain aesthetic features of the plot, e.g. the
x
axis will represent the variable called year
and the y
axis will represent the variable called
girls
. Then, we add another layer to this plot where we
define which geom
etric shapes we want to use to represent
each observation in the data. In this case we want these to be points,
hence we use geom_point
.
ggplot(data = arbuthnot, mapping = aes(x = year, y = girls)) +
geom_point()
If this seems like a lot, it is. And you will learn about the philosophy of building data visualizations in layers in detail soon. For now, follow along with the code that is provided.
Change the look of your report:
Click on the gear icon in on top of the R Markdown document, and select “Output Options…” in the dropdown menu. In the General tab of the pop up dialogue box try out different Syntax highlighting and theme options. Hit OK and Knit your document to see how it looks. Play around with these until you’re happy with the look.
Getting help:
R extensively documents all of its functions; to read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in.
Try the following:
?dim
Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.
Tip: If you use the up and down arrow keys, you can scroll through your previous commands in the console, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.
geom_point
to geom_line
so that instead of
having to type the entire line over again you can use the previously run
code. Is there an apparent trend in the number of girls baptized over
the years? How would you describe it? (To ensure that your lab report is
comprehensive, be sure to include the code needed to make the plot as
well as your written interpretation.)Now, suppose we want to plot the total number of baptisms. We can type in mathematical expressions like
5218 + 4683
to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys to that of girls, R will compute all sums simultaneously. Type the following in the console.
$boys + arbuthnot$girls arbuthnot
What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after. Take a look at a few of them and verify that they are right. If you add two vectors of numbers together that are the exact same size (i.e. they have the same number of elements), R will add component-wise. In other words, the first elements of each vector will be added together, the second elements of each vector will be added together, etc. You will get a new vector of the same size (or length) as the original two, but with all the sums.
We’ll be using this new vector to generate some plots, so we’ll want to save it as a permanent column in our data frame. Type the following in the console.
<- arbuthnot %>% mutate(total = boys + girls) arbuthnot
What in the world is going on here? The %>%
operator
is called the piping operator. Basically, it takes
whatever is to its left and pipes it into the first argument of the
function on its right. The %>%
operator also has a
keyboard shortcut you might want to learn earlier rather than later. We
will use it a lot.
A note on piping: Note that we can read this code as the following:
“Take the arbuthnot
dataset and
pipe it (as the first argument) into the
mutate
function. Using this mutate a new variable called
total
that is the sum of the variables called
boys
and girls
. Then assign this new resulting
dataset to the object called arbuthnot
, i.e. overwrite the
old arbuthnot
dataset with the new one containing the new
variable.”
This is essentially equivalent to going through each row and adding up the boys and girls counts for that year and recording that value in a new column called total.
Where is the new variable? When you make changes to variables in your dataset, click on the name of the dataset again (in the Environment Tab) to update it in the data viewer.
You’ll see that there is now a new column called total
that has been tacked on to the data frame. The special symbol
<-
performs an assignment, taking the output of
one line of code and saving it into an object in your workspace. In this
case, you already have an object called arbuthnot
, so this
command updates that data set with the new mutated column.
We can make a plot of the total number of baptisms per year with the command (type all code below in the console)
ggplot(data = arbuthnot, mapping = aes(x = year, y = total)) +
geom_line()
Similarly to how we computed the total number of births, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with
5218 / 4683
or we can act on the complete columns with the expression
<- arbuthnot %>% mutate(boy_to_girl_ratio = boys / girls) arbuthnot
We can also compute the proportion of newborns that are boys in 1629
5218 / (5218 + 4683)
or this may also be computed for all years simultaneously and append it to the dataset:
<- arbuthnot %>% mutate(boy_ratio = boys / total) arbuthnot
Note that we are using the new total
variable we created
earlier in our calculations.
total
column to make a new variable, you need to have put the code to make the
total
variable in your Markdown document. It doesn’t matter
that you’ve already done this in the console. Knit to make sure
everything is working. What do you see? Enter your answer in your
Markdown document underneath your code chunk.Finally, in addition to simple mathematical operators like
subtraction and division, you can ask R to make comparisons like greater
than, >
, less than, <
, and equality,
==
. For example, we can ask if boys outnumber girls in each
year with the expression
<- arbuthnot %>% mutate(more_boys = boys > girls) arbuthnot
This command adds a new variable to the arbuthnot
dataframe containing the values of either TRUE
if that year
had more boys than girls, or FALSE
if that year did not
(the answer may surprise you). This variable contains different kind of
data than we have considered so far. All other columns in the
arbuthnot
data frame have values are numerical (the year,
the number of boys and girls). Here, we’ve asked R to create
logical data, data where the values are either
TRUE
or FALSE
. In general, data analysis will
involve many different kinds of data types, and one reason for using R
is that it is able to represent and compute with many of them.
There has been some confusion about this section in the past, so just to be clear, yes, these “More exercises” are assigned and part of the grade.In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. Load up the present day data with the command below. Any code you use to answer the questions below should be entered into your R Markdown file by first inserting code chunks and then typing the code you need inside the chunks. Any answers to questions should be entered OUTSIDE the code chunks. Please enter all code and text underneath the respective question number headings.
data(present)
The data are stored in a data frame called present
.
What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
How do these counts compare to Arbuthnot’s? Are they on a similar scale? Why or why not?
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from an exercise above, just replace the dataframe name.
In what year did we see the most total number of births in the
U.S.? Hint: Sort your dataset in descending order based on the
total
column you created for the previous question. You can
do this interactively in the data viewer by clicking on the arrows next
to the variable names. To include the sorted results in your report you
will need to use two new functions: arrange
(for sorting)
and desc
(for descending order). Sample code provided
below.
%>% arrange(desc(total)) present
Complete the following with code in a code chunk (no text necessary). Remember that the code is just instructions for R.
y
with the value of 7 and a
variable x
with a value of 8.x
by y
, and store the answer in a
variable named z
.6 + 3
6 + 3
as
a variable called a
.Your file will save automatically when you Knit, and your final lab must Knit without error before you submit, so Knit often! You want to figure out if you have errors in your Markdown document as you go along, not at the end.
Now let’s practice some basic formatting. Using this formatting tips page figure out how to put the following into your lab report. These all can get typed into the white section, where text goes. Hint: To put each of these on its own line hit a hard return (an extra one) between each line of text.
These data come from reports by the Centers for Disease Control listed in the references section.
When you are finished with the lab, go to the very top and change the
output from html_document
to pdf_document
. The
pdf document doesn’t look as nice, but it is easier to grade and upload
to schoology. Now turn in this PDF file to Schoology. Note the due date
and time. If Schoology says it’s late, it’s late. Make sure your
final Markdown document Knits properly and shows all your work. Look
through it to make sure everything looks organized and professional.
Also remember that if you needed output (graphs, numeric output, etc.)
to answer a question, the code to generate that output needs to be in
the lab report. Other code should not be included.